Search CORE

13 research outputs found

Accelerating sequential programs using FastFlow and self-offloading

Author: Aldinucci Marco
Danelutto Marco
Kilpatrick Peter
Meneghin Massimiliano
Torquati Massimo
Publication venue
Publication date: 12/02/2010
Field of study

FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

arXiv.org e-Print Archive

UnipiEprints

An Optimization Theory for Structured Stencil-based Parallel Applications

Author: MENEGHIN MASSIMILIANO
Publication venue: 'Pisa University Press'
Publication date: 14/06/2010
Field of study

In this thesis, we introduce a new optimization theory for stencil-based applications which is centered both on a modiﬁcation of the well known owner-computes rule and on base but powerful properties oftoroidal spaces. The proposed optimization techniques provide notable results in diﬀerent computational aspects: from the reduction of communication overhead to the reduction of computation time, through the minimization of memory requirement without performance loss. All classical optimization theory is based on deﬁning transformations that can produce optimized programs which are computationally equivalent to the original ones. According to Kennedy, two programs are equivalent if, from the same input data, they produce identical output data. As other proposed modiﬁcations to the owner-computes rule, we exploit stencil application feature of being characterized by a set of consecutive steps. For such conﬁgurations, it is possible to deﬁne speciﬁc two phase optimizations. The ﬁrst phase is characterized by the application of program transformations which result in an eﬃcient computation of an output that be easily converted into the original one. In other words the transformed program deﬁned by the ﬁrst phase is not computational equivalent with respect to the original one. The second phase converts the output of the previous phase back into the original one exploiting optimized technique in order to introduce the lowest additional overhead. The phase guarantees the computational equivalence of the approach. Obviously, in order to deﬁne an interesting new optimization technique, we have to prove that the overall performance of the two phases sequence is greater than the one of the original program. Exploiting a structured approach and studying this optimization theory on stencils featuring speciﬁc patterns of functional dependencies, we discover a set of novel transformations which result in signiﬁcant optimizations. Among the new transformations, the most notable one, which aims to reduce the number of communications necessary to implement a stencil-based application, turns out to be the best optimization technique amongst those cited in the literature. All the improvements provided by transformations presented in this thesis have been both formally proved and experimentally tested on an heterogeneous set of architectures including clusters and diﬀerent types of multi-cores

Electronic Thesis and Dissertation Archive - Università di Pisa

FastFlow: Efficient Parallel Streaming Applications on Multi-core

Author: Aldinucci Marco
Meneghin Massimiliano
Torquati Massimo
Publication venue
Publication date: 02/09/2009
Field of study

Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.Comment: 23 pages + cove

arXiv.org e-Print Archive

UnipiEprints

Minimizing Communications with Q-transformations in Uniform and Affine Stencils

Author: Meneghin Massimiliano
Vanneschi Marco
Publication venue: Università di Pisa
Publication date: 07/10/2009
Field of study

In stencil based parallel applications, communications represent the main overhead, especially when targeting a fine grain parallelization in order to reduce the completion time. Techniques that minimize the number and the impact of communications are clearly relevant. In literature the best optimization reduces the number of communications per step from 3dim, featured by a naive implementation, to 2*dim, where dim is the number of the domain dimensions. To break down the previous bound, in the paper we introduce and formally prove Q-transformations, for stencils featuring data dependencies that can be expressed as geometric affine translations. Q-transformations, based on data dependencies orientations though space translations, lowers the number of communications per step to dim

UnipiEprints

Efficient Smith-Waterman on multi-core with FastFlow

Author: Marco Aldinucci
Massimiliano Meneghin
Massimo Torquati
Publication venue
Publication date: 01/01/2010
Field of study

Abstract—Shared memory multiprocessors have returned to popularity thanks to rapid spreading of commodity multi-core architectures. However, little attention has been paid to supporting effective streaming applications on these architectures. In this paper we describe FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-theart programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than them on a given real world application: the speedup of FastFlow over other solutions may be substantial for fine grain tasks, for example +35% over OpenMP, +226 % over Cilk, +96 % over TBB for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm. I

CiteSeerX

Crossref

Archivio della Ricerca - Università di Pisa

Recommended from our members

Neon: A Multi-GPU Programming Model for Grid-based Computations

Author: Jayaraman Pradeep Kumar
Mahmoud Ahmed H.
Meneghin Massimiliano
Morris Nigel J. W.
Publication venue: eScholarship, University of California
Publication date: 01/05/2022
Field of study

We present Neon, a new programming model for grid-based computation with an intuitive, easy-to-use interface that allows domain experts to take full advantage of single-node multi-GPU systems. Neon decouples data structure from computation and back end configurations, allowing the same user code to operate on a variety of data structures and devices. Neon relies on a set of hierarchical abstractions that allow the user to write their applications as if they were sequential applications, while the runtime handles distribution across multiple GPUs and performs optimizations such as overlapping computation and communication without user intervention. We evaluate our programming model on several applications: a Lattice Boltzmann fluid solver, a finite-difference Poisson solver and a finite-element linear elastic solver. We show that these applications can be implemented concisely and scale well with the number of GPUs—achieving more than 99% of ideal efficiency

eScholarship - University of California